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Abstract: A large class of computational problems are characterized by frequent synchronization, and 
computational requirements which change as a function of time. When such a problem must be solved 
on a message passing multiprocessor machine, the combination of these characteristics lead to system 
performance which decreases in time. Performance can be improved with periodic redistribution of 
computational load; however, redistribution can exact a sometimes large delay cost We study the 
issue of deciding when to invoke a global load remapping mechanism. Such a decision policy must 
effectively weigh the costs of remapping against the performance benefits. We treat this problem by 
constructing two analytic models which exhibit stochastically decreasing performance. One model is 
quite tractable; we are able to describe the optimal remapping algorithm, and the optimal decision pol- 
icy governing when to invoke that algorithm. However, computational complexity prohibits the use of 
the optimal remapping decision policy. We then study the performance of a general remapping policy 
on both analytic models. This policy attempts to minimize a statistic W(ri) which measures the system 
degradation (including die cost of remapping) per computation step over a period of n steps. We show 
that as a function of time, the expected value of W(ri) has at most one minimum, and that when this 
minimum exists it defines the optimal fixed-interval remapping policy. Our decision policy appeals to 
this result by remapping when it estimates that W(n) is minimized. Our performance data suggests that 
this policy effectively finds the natural frequency of remapping. We also use the analytic models to 
express the relationship between performance and remapping cost, number of processors, and the 
computation’s stochastic activity. 
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1. Introduction 


Many computational problems assume a discrete model of a physical system, and calculate a set 
of values for every domain point in the model. These values are often functions of time, so that it is 
intuitive to think of the computation as marching through time. When such a problem is mapped onto a 
message passing multiprocessor machine or a shared memory machine with fast local memories, 
regions of the model domain are assigned to each processor. The running behavior of such a system is 
often characterized as a sequence of steps, or iterations. During a step, a processor computes the 
appropriate values for its domain points. At the step’s end, it communicates any newly computed 
results required by other processors. Finally, it waits for other processors to complete their computa- 
tion step, and send it data required for the next step’s computation. 

The computational work associated with each portion of a problem’s subdomain may change over 
the course of solving the problem. This may be true because the behavior of the modeled physical 
system may change with time. The distribution of computational work over a domain may also change 
in problems without explicit time dependence. For example, during the course of solving a problem, 
more work may be required to resolve features of the emerging solution. Since time stepping is often 
used as a means for obtaining a steady state solution, there is considerable overlap between the above 
mentioned categories. We call these types of problems varying demand distribution problems. 
Because of the synchronization between steps, the system execution time during a step is effectively 
determined by the execution time of the slowest, or most heavily loaded processor. We can then 
expect system performance to deteriorate in time, as the changing resource demand causes some pro- 
cessor to become proportionally overloaded. One way of dealing with this problem is to periodically 
redistribute, or remap load among processors. 

Changing distributions of computational work over a domain arise through the use of adaptive 
methods in the solution of hyperbolic partial differential equations. These solutions place extra grid 
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points in some regions of the problem domain in order to resolve all features of the solution to the 
same accuracy [7],[8],[20],[33], [24], [45]. A number of studies have investigated methods for redistri- 
buting load in message passing multiprocessors for this type of problem [6],[21],[22]. Changing distri- 
butions of computational work can also occur when vortex methods are applied to the numerical simu- 
lation of incompressible flow fields. In these methods, invicid fluid dynamics is modeled by parcels of 
vorticity which induce motion in one another [1], [27], The number of vortices corresponding to a 
given region in the domain varies during the course of the solution of a problem. Methods for dynam- 
ically redistributing work in this problem have been investigated [2]. 

In multirate methods for the solution of systems of ordinary differential equations [46], different 
variables in the system of equations are stepped forward with different timesteps. The size of the 
timesteps in the system is generally equal to that of a globally defined largest timestep divided by an 
integer. The size of the different timesteps utilized may vary during the course of solving the problem, 
and hence the computational work associated with the integration of a given set of variables may 
change. Another class of problems which may have varying resource demands are adaptive methods 
for solving elliptic partial differential equations where iterations on a sequence of adaptively defined 
meshes are carried out [4], [28], [10], [3], [47]. Generally both the total amount of computational work 
required by each of the meshes and the distribution of work within the domain changes as one moves 
from one mesh to the next. There are also non-numerical parallel computations which can exhibit 
varying computational requirements. In time driven discrete event simulations [18], one simulates the 
interactions over time of a set of objects. Responsibility for a subset of objects is assigned to each 
processor. Over the course of the simulation, subsets of objects may differ in activity, and hence in 
their computational requirements. This problem may also arise in parallel simulations which proceed in 
a loosely synchronized manner, such as those described in [11], [13], [26], [32], [34], [30]. 
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There are two fundamentally different approaches to such remapping. The decentralized load 
balancing approach is usually studied in the context of a queueing network [17],[19],[29],[40],[41], 
[43], [44]. Load balancing questions are then focused on "transfer policies", and "location policies"[17]. 
A transfer policy governs whether a job arriving at a service center is kept or is routed elsewhere for 
processing. A location policy determines which service center receives a transferred job. Decentral- 
ized balancing seems to be the natural approach when jobs are independent, and a global view of 
balancing would not yield substantially better load distributions. 

However, a large class of computations is not well characterized by a job arrival model, and it 
may be advantageous to take a global, or centralized perspective when balancing. We will call a global 
balancing mechanism "mapping" to distinguish it from the localized connotations of the term load 
balancing. A centralized mapping mechanism can exploit full knowledge of the computation and its 
behavior. Furthermore, dependencies between different parts of a computation can be complex, mak- 
ing it difficult to dynamically move small pieces of the computation from processor to processor in a 
decentralized way. Global mapping is natural in a computational environment where other decisions 
are already made globally, e.g. convergence checking in an iterative numerical method. Yet the execu- 
tion of a global mapping algorithm may be costly, as may the subsequent implementation of the new 
workload distribution. A number of authors have considered global mapping policies under varying 
model assumptions, for example, see [12], [16], [23], [42] , [5] , [9], A comparison between global 
and decentralized mapping strategies is reported in [25]. 

For the types of problems we describe, remapping the load with a global mechanism is tan- 
tamount to repartitioning the set of model domain points in regions, and assigning the newly defined 
regions to processors. A mapping algorithm of this sort is studied in [6] and the performance of this 
mapping algorithm in the context of vortex methods is investigated in [2], Decision policies determin- 
ing when a load should be remapped become quite important. The overhead associated with remapping 
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can be high, so it is important to balance the overhead cost of remapping with the expected perfor- 
mance gain achieved by remapping. While this is a generic problem, the details of load evolution, of 
the remapping mechanism, and of various overhead costs are system and computation dependent. In 
order to study general properties of remapping decision policies, it is necessary to model the behavior 
of interest, and evaluate the performance of decision policies on those models. Remapping is treated 
this way in [31] under the assumption that the parallel computation has multi -phase behavior. In the 
present paper we consider remapping of varying demand distribution problems using two different sto- 
chastic models. An overview by the present authors introducing some of the ideas developed in this 
paper is presented in [37]. 

The evaluation of policies for memory management in multiprogrammed uniprocessor systems 
has successfully employed a number of stochastic models to reflect the memory requirements of typical 
programs [15], [39]. In these models, the principal of memory reference locality plays a central role. 
Evaluating policies for scheduling a remapping in message passing machines is somewhat similar in 
spirit to the evaluation of paging algorithms in multiprogrammed uniprocessor systems. The principal 
of locality that we attempt to capture here is the locality of resource demand. The computational work 
corresponding to the problem region assigned to a given processor will often vary in a gradual fashion. 
In this paper we consider two models which describe this evolution probabilistically. The first model 
assumes that the computational requirements of each partition region behaves as a Markov chain, 
independently of any other region; this is called the Multiple Markov chain (MUM) model. The MUM 
model has the advantage of being analytically tractable in several ways. However, for many problems 
it may not be reasonable to assume independence in load evolution between partition regions. We 
address this issue with a second, less tractable model, the Load Dependency (LD) model. These 
models attempt to capture the dynamics by which the distribution of computational load changes in 
time, and are characterized by a small number of important parameters. Through the use of these load 
evolution models, we are able to evaluate policies for deciding when load should be remapped. 
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In this paper we propose a policy which attempts to minimize the statistic W(«) measuring the 
average (over n steps) system degradation per step (including the cost of a remapping). We show that 
as a function of n, the expected value of W(n) has at most one minimum. If the minimum occurs at n, 
then we show that the optimal fixed-interval remapping policy is to remap every n steps. These results 
support the general philosophy of the heuristic. Empirical studies based on our models show that the 
heuristic is effective on different models of load behavior. We also describe analytical work which 
looks at the relative effects the model parameters on remapping frequency. 

This paper is organized as follows. Section 2 describes the Multiple Markov chain model of com- 
putational variation, and shows how it captures the drifting computational load phenomenon. Section 3 
discusses optimal algorithms for both determining haw to remap, and for determining dynamically 
when to remap. The high computational expense of computing the optimal decision policy leads us to 
define a simple inexpensive heuristic policy in section 4, where we also discuss the heuristic’s perfor- 
mance. Section 5 then presents an analysis of the statistic used by the heuristic, and shows analytically 
that the heuristic is well motivated. Section 6 presents an alternate model of load variation, and shows 
how the statistic of section 4 is also effective with this model. Section 7 summarizes our results, and 
the appendices treat technical issues in detail. 

2. Multiple Markov Chain Model 

A processor is assigned a region of the problem domain. At each step, the processor needs to 
perform a certain amount of computation related to that region. Upon completion of this computation, 
it may send messages to other processors, reporting newly calculated results. Before advancing to the 
next step, the processor then synchronizes as required by the computation. The computational require- 
ments of the region may vary gradually from step to step. The MUM model characterizes this varia- 
tion by modeling the work demands on each processor using a Markovian birth-death process. The 
state s of the chain is a positive integer describing the execution time of the processor at a step. We 
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also assume that s < L for some L. The transition probabilities out of s reflect the principle of locality, 
where all one step transitions are to neighboring states. When s is between 2 and L- 1, the probability 
that the chain will make a one step transition to state 5+1 is pi 2, the probability of a one step transition 
to J-l is pi 2 and the probability that the chain will not make a transition is \-p. For s = 1 or s = L, 
the state remains the same with probability 1 - pi 2, and moves to the single neighboring state with 
probability pi 2. 

The processors are modeled by a collection of independent, identically distributed Markov chains. 
We let Tj{n) represent the time required by the jth processor to complete the nth step. Assuming N 
processors, the time required for the system as a whole to complete the nth step is given by 


The average processor execution time during the nth step is 

T(n) = -JrZT’/'O- 
N M 

Then the average processor utilization during the nth step is 


PC) = #? 


^max( n ) 

and the average period of time that a processor is idle waiting for other processors to finish step n is 
consequently given by T max («)-7’( /1 ). Finally, we assume that the states of every chain at step 0 are 
identical. 


An intuitive feel for the behavior of the MUM model is gained by examining graphs of particular 
performance measures. Figure la depicts the behavior of the MUM model for varying numbers of 
chains. The performance shown is the average (per step) processor utilization as a function of step, 
taken over 500 simulations or sample paths, where p = 0.5 and each chain has 19 states. Performance 
declines more quickly and to a lower level as one simulates a problem with an increasingly large 
number of independent processors. For a given number of chains, the performance decline arises from 
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the fact that the expected value of T(rt) remains relatively constant as n increases, while the expected 
value of T max (ri) increases in n. Figure lb depicts the performance of single sample paths of the MUM 
model using varying numbers of chains, where as before p = 0.5 and each chain has 19 states. Note 
that the decline in performance as a function of step is true only in the sense of comprising a long term 
trend; each curve has many local maxima and minima. This point is particularly important, because 
any dynamic real time remapping policy mechanism is concerned with the current single sample path 
defined by the computation’s execution. 

3. Optimal MUM Model Load Balancing 

We now consider the problems associated with remapping a computation described by a MUM 
model. There are two basic issues, how to remap, and when to remap. The tractability of the MUM 
model allows us to describe optimal policies for both of these issues. We treat the remapping mechan- 
ism by demonstrating that under certain regularity conditions, the obvious technique of assigning equal 
loads to each processor is optimal. However, we will note that deviation from these regularity condi- 
tions can cause this technique to be sub-optimal. We treat the remapping decision problem in the 
framework of a Markov decision process. We are then able to symbolically express an optimal remap- 
ping decision policy in terms of a solution to a set of equations. However, the complexity of solving 
these equations grows exponentially in the size of the problem, so that this technique is not useful for 
any but the smallest sized problems. This realization leads us in section 4 to consider a sub-optimal 
but computationally simple decision heuristic. 

Consider the situation where the computation has completed step n, and we wish to redistribute 
the computational load. For every processor j, we can determine its load during step n, and can deter- 
mine that step n required Tfti) processing time. We then suppose that it is possible to repartition the 
problem domain into N regions with equal (or near equal) computational demand during step n. This 
new partition is implemented for step n + 1. We call this mechanism average value remapping. While 



the optimality of this policy may seem trivially obvious, later reflection shows that this may not always 
be the case. For example, if one processor’s computational load tends to evolve faster than others’, we 
may want to assign it less than the average value load upon remapping. Called padding [38], such a 
policy accepts slightly decreased immediate performance in order to forestall the active processor from 
quickly degrading performance. Nevertheless, under certain conditions (which exclude this difficulty), 
we can show that the average value load policy is optimal. The proof of this claim is somewhat techn- 
ical, and is provided in Appendix A. To formally state the claim, we define 


and 


Tj{n+djj) = Tfti+d) given that T } {n) - Sj 


Then 


T mu( n+d ’ s b • * • > s n) = max {Tj(n+djj)}. 


THEOREM 3.1 : Assume that the Markov chains are homogeneous, and unbounded (no max- 
imum nor minimum state), and that the transition probabilities are unaffected by remapping. Sup- 
N 

pose £ Tj{njj) = K. For every i = 1,2, • • • jV, define 
y=i 



+ 1 if i < K mod N 
if i > K mod N 


Then for every d > 0 we have 


£Umax(rt+^l, • • • ,%)] ^ £[r m ax(«+d,ai, ' ‘ * ,%)]• 


□ 


Theorem 3.1 is a strong statement of the average value remapping policy’s optimality. It says that 
employing this policy at step n minimizes the expected execution time of every future step, if no 
further remapping is permitted after step n. 

The second major remapping issue is when to remap. If there were no cost associated with 
remapping, then we would remap frequently to maximize processor utilization. With an increasing cost 
of remapping, we will want to remap less and less often, to better amortize the cost of remapping over 
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a larger number of computation steps. We can employ a Markov decision process framework [35] to 
formally state the remapping decision issue. Within this framework we are able to symbolically 
describe the optimal remapping decision policy. 

Our notation concerning Markov decision processes is taken largely from [35], Consider a sto- 
chastic process whose state we observe at each of a sequence of times t = 0, 1, • • • . Let / be the set 
of all possible states. At each time j, the state of the process is discerned to be some s e /. Then a 
decision is made, choosing some action a from a finite set A; the choice of action a while in state s 
incurs a cost c(s,a). c(s,a) may be random; we assume that £[c(s,a)] is finite for all states s and actions 
a. The decision process then passes into another state. The probability p sq (a) of passing into state q 
from s is dependent on the action a chosen in state s. The expected total cost of a decision policy is 
the expected sum of the costs incurred at each decision step. An optimal decision policy minimizes the 
expected total cost. 

We restrict our attention to the class of stationary decision policies, those policies which are 
deterministic functions of the discerned state. The following useful theorem concerning optimal sta- 
tionary decision policies is given by [35]. 

Let V(s) be the expected total cost of the process which starts in state s, and which is governed by the 
optimal stationary policy. Then, 

V(s) = min 

a 6 A 

□ 

The function F(s) is known as the optimal cost function. From state s, the optimal stationary decision 
is the choice of action which minimizes the right hand side of equation (1). We now formulate the 
MUM remapping decision problem in terms of a Markov decision process. 


c(s,a) + £ P sq {o)V{q) 
qe l 


( 1 ) 
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The state of our decision process at step n is the vector of chain states <7’ 1 (n), • • • ,Tj^n),n> 
along with the step number, n. For any decision state S = <y lt • • • s N ,n>, let J(S ) denote the set of 
states reachable from <5j, • • • s N ,n> in one step , and let .<4(5) denote the system state achieved by 
performing average value remapping on S. Each T = <t u • • • ,t N ,n+ 1> g J(<s u • • • s N ,n>) has a tran- 
sition probability 

N 

Prob{T | 5} = n Prob{ti | j,} 

r=l 

where Prob{t t [ s,} is the probability of chain i passing from state into state t t in one step. The exe- 
cution cost of state S is simply 


EC(S) = max {s.-}, 

1S&/V 

and the delay cost of doing a remapping is C. This cost includes both the communication costs and 
the computational overhead required for performing a remapping operation. This delay cost C is in 
general expected to be a function of the state S, although we will suppress this dependence here for the 
sake of simplicity. We can observe the system state S only by allowing the system to execute the step 
during which state S is achieved. Thus the decision process state encodes the performance of the last 
step’s execution. If we remap from state S, this assumption implies that the next system state will be a 
member of J(A(S)). Equation (1) may now be written as 


V(S) = min 1 


EC(S) + C+ X Prob{U}V(U) 

U € J(A(S)) 

EC(S) + X Prob{U}V(U) 


( 2 ) 


[ Ue J(S) 

where the top equation on the right-hand side is the cost function associated with remapping, and the 
bottom equation is associated with not remapping. According to the theory of Markov decision 
processes, the optimal decision to make from state S is the decision which minimizes the right-hand 
side of equation (2). If the number of steps taken by the system is some random variable with finite 
mean, then the system of equations given by (2) can be solved, at least in theory. In practice, we have 
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not found any means of substantially reducing the enormous state space addressed by these equations, 
and so the exact solution of the optimal remapping decision policy is restricted to relatively small sys- 
tems. Furthermore, the optimal Markov decision process as formulated here is applicable only to the 
MUM load model. It depends on foreknowledge of the precise probabilistic structure of the MUM 
model and on the independence of the MUM model chains. All of these issues make this decision 
model an unrealistic candidate as a decision mechanism. In the next section we introduce a heuristic 
which avoids these difficulties. 

4. Stop At Rise Decision Policy 

Because of the difficulties inherent in the optimal MUM model remapping decision policy, we 
consider a sub-optimal heuristic called the Stop At Rise (SAR) policy. The SAR policy attempts to 
minimize the average time per step that a processor spends inactive due to synchronization or remap- 
ping delay. In this section we describe SAR and present performance data. 

Any remapping decision policy must attempt to reconcile the costs of remapping against the 
increasing execution costs suffered by not remapping. However, the increasing execution costs are 
future costs, and are consequently uncertain. The optimality of the MUM Markov decision process 
policy stems from its explicit consideration of all possible future activity and costs. A real policy can- 
not afford this computational luxury. Instead, we turn to SAR, a "greedy" policy, which attempts to 
remap so that the average long-term processor idle time since the last remapping is minimized. This 
calculation of idle (or wasted) time includes the time C spent in one remapping operation. Supposing 
that the load was last remapped n steps ago (say, just before step 1), the average processor idle time 
per step that we achieve by remapping immediately is denoted W(/i), and is given by 

£(«/> - m) + C 

W(rt) = & . 


n 
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W(n) explicitly embodies two of the costs a remapping policy must manage. C is the delay cost 

n _ 

of remapping once, X(^maxO) - T(/)) is interpreted as the cost of not remapping. However, under the 
H 

greedy philosophy we have adopted, this latter cost is a past cost of not remapping, rather than a future 
cost. 

It is instructive to consider the behavior of W(n) as a function of n. Figure 2a plots W(ri) for sin- 
gle sample paths when p = 0.5, C = 8.0, and the chains have 19 states. Paths from 2, 8, and 32 chain 
systems are shown. Figure 2b plots £[W(n)] under these same parameter values. Both graphs show 
W(/i)’s marked propensity to drop, be minimized over some section of its domain, and then begin and 
continue to rise. This behavior can be explained in terms of W(/i)’s definition. There is a tendency for 
W(n) to decrease in n, as increasing n amortizes the cost C over a larger number of computation steps. 

n _ 

But there is also a tendency for W(n) to increase in increasing n, as we may expect £T(/)/n to remain 

/=i 

n 

relatively constant, and x (j)ln to increase with n. The tendency to decrease dominates initially, 

M 

but has less effect on W(n) as n grows. The tendency to increase then becomes predominate. We 
once again note that this tendency is an expected tendency. The precise behavior of W(«) for a given 
sample path will vary with that sample path. Whereas it is reasonable to expect that E\W(n)\ has a 
single local minimum (a topic we discuss in section 5), the values of W(n) for a given sample path 
may exhibit multiple local minima. The significant implication of these observations is that we do not 
have the option of remapping at a time step n with any assurance that W(n) will minimize the statistic. 
We can however remap once the first local minimum of W(ri) is detected. We thus choose to remap at 
the first step n after the last remapping such that W(n) > W(/j- 1). This policy of remapping when a 
local minimum of W(n) is first detected is labeled the SAR (Stop At Rise) policy. 

We studied the performance of the SAR policy by comparing it to three other policies: the 
optimal policy, the "remap every m steps" policy, and the policy which never remaps. It is possible to 
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compute the expected time required to complete small sized problems when the optimal Markov deci- 
sion policy is utilized to decide when to remap. Figure 3 compares a performance metric for the Mar- 
kov decision policy, SAR, and a non-remapped system for three chains, 100 steps, and various remap- 
ping costs. The SAR data is the average of 500 simulation runs for each value of C. As the remap- 
ping cost increases, the discrepancy between the performance obtained through the optimal decision 
policy and SAR increases. With increasing remapping cost, both the performance of the optimal deci- 
sion policy and of SAR approach the performance obtained when no remapping is performed. 

The performance metric used in figure 3 for all policies depicted in that figure is an estimate of 

n 

processor utilization: the ratio of the expected £ T(j ) to the expected total time spent by the system to 

solve the problem, including the cost of all remappings. This measure is useful in figure 3 as it is 
straightforward in the case of the optimal policy to calculate the expected time required to complete a 

n 

problem as well as the expected £T(/)- For all subsequent figures, performance data is obtained by 

>1 

simulation, and the easily computed average performance over all simulations is utilized, i.e. the mean 

n _ 

of the ratio of the J ]T(j ) to the total time spent by the system to solve the problem, including the cost 
>=i 

of all remappings. Both performance measures were computed for all simulations, and found to differ 
from each other by less than one percent. 

One simple but intuitive remapping policy is the "remap every m steps" policy, or fixed interval 
policy. This policy is insensitive to statistical variations in a system’s performance, and requires pre- 
run-time analysis to determine an effective value of m. However, we might well choose to employ a 
fixed interval policy if it is costly to measure system performance at every step. In this case, we would 
attempt to choose m to optimize the system’s expected performance. In figure 4, we compare the per- 
formance obtained through the use of: (1) SAR , (2) the fixed interval policy for a wide range of 
values of m, and (3) not remapping at all. The performance obtained in a system using the MUM 
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model with eight independent processors is depicted. Each problem consists of 400 steps, each data 
point is obtained through 200 simulations, and remapping costs of 2 and 8 are assumed. For the SAR 
policy, we plotted performance against the calculated average number of steps between consecutive 
remappings. In the fixed interval policy, we plotted performance against m, the fixed number of steps 
between remappings. The number of steps between remappings has no meaning when no remapping is 
done, the performance obtained when no remapping occurs is plotted as a straight horizontal line to 
facilitate comparison with the other results. The calculation of the performance obtained through the 
use of the optimal Markov decision policy is not practical in this case due to the long run times and 
large memory requirements that would be required. 

It is notable that SAR’s performance was comparable and in fact slightly higher than that 
obtained by remapping at the optimal fixed interval. The average number of elapsed steps between 
SAR remappings corresponds closely to the optimal fixed interval remapping policy. Similar results 
were obtained in other cases using the MUM load model. These results are encouraging for two rea- 
sons. Since SAR adapts to statistical variations in the system’s behavior, we would hope that it can 
outperform a non-adaptive policy. Our data shows that SAR outperforms the optimal fixed interval 
policy. Secondly, SAR appears to find the "natural frequency" of remapping for a given remapping 
cost. While the exact number of steps between remappings may vary with the system’s sample path, 
the average number of steps between remappings is close to that of the optimal fixed interval policy. 
Note also that the performance obtained by SAR is markedly superior to the performance obtained 
when no remapping is performed. From extensive simulation results not presented here, we found that 
the difference between the performance obtained by SAR and the performance obtained when no 
remapping is performed increases with the number of chains. This is consistent with the observed 
results in figures 3 and 4. 



In the face of uncertainity about future problem behavior, it is reasonable to design a remapping 
decision policy which optimizes performance locally in time. The SAR policy does this by attempting 
to minimize W(n), a statistic which measures performance since the last remapping. Performance 
experiments show that the SAR policy effectively finds the "natural frequency" of remapping as a 
function of the rate at which resource demand changes, and the cost of remapping. As such, SAR is a 
promising policy for real remapping situations. In section 6 we demonstrate that the SAR policy can 
also be effectively employed with computational models other than the MUM model. 

5. Analysis of E[W(n)] 

In this section we analyze the behavior of £[W(n)]. First we show that if the difference between 
expected maximum load and average load is an increasing function, then £[W(n)] has at most one local 
minimum as a function of n. We then show that if £[W(n)] has a minimum at ft, then the fixed-interval 
decision policy of remapping every ft steps minimizes the expected loss per step of any fixed-interval 
policy, including the policy which prohibits remapping. These results are independent of the MUM 
model; we then present conditions on MUM parameters for which these results apply, and show that 
MUM assumptions allow us to compute £[W(m)] exactly for all n. Finally, we analyze the effects that 
problem behavior and remapping costs have on optimal (fixed-interval) remapping frequency. This 
analysis gives us both a qualitative and a quantitative grasp of the relationships between the different 
factors involved in a decision to remap. Furthermore, since our SAR data indicates that SAR adap- 
tively finds the optimal fixed-interval remapping frequency, we expect these relationships to hold true 
for SAR. 

We will first show that under a reasonable hypothesis, the expected time wasted per step, 
£[W(n)], can have no more than one local minimum. Moreover, the step ft at which the minimum of 
£[W(n)] occurs is a monotone increasing function of the cost of remapping C. In order to minimize 
wasted time per step, if remapping becomes more costly, the number of steps between successive 
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remappings must increase. After remapping we assume that the expected partitioning of load between 
processors becomes increasingly uneven. The hypothesis used in the theorem formalizes this and sup- 
poses that the expected difference between the maximum order statistic and the average is monotone 
increasing with n. 

THEOREM 5.1 : Suppose £,• = E\T rrax {i)] - £[T(z)] is monotone increasing in i. Then E\W(n)] 
has at most one minimum and if this minimum exists, it is a monotone increasing function of the 
remapping cost C. 

PROOF: Let 5(rt) = £[W(n+l)] - E\W{n)l and let 

K (n) = nE n+l - £ £<■ 
i=l 

From the definitions of 5 (n), E[W(n)] E t and k(«), it is straightforward to show that 

5(n) <0 iff K (n) < C 

and that 


5 (n) >0 iff k(/j) > C. 

Using the assumption that E t is monotone increasing, we first show that K(n) is monotone increasing. 
This follows, since 



n+ 1 


n 

K(n+1) - K [n) = 

(n + l )£„+2 ~ Z E i 

- 

"E^i - £ Ei 


i=i 




= (n+l)(£„ +2 - £„+i) £ 0. 


For the sake of contradiction, assume that there are local minima at n x and at n 2 , without loss of 
generality we suppose that n 2 > «i. Since n i is a local minimum, £(W(«i+l)] > £[W(«i)], so that 
5(«i) > 0 and hence K(/ij) > C. If n 2 is also a local minimum, then ElW^nff] < E\W{n 2 - 1)], so that 
S(rc 2 -1) < 0 and hence k(/5 2 -1) < C. If n t = n 2 -l we have a direct contradiction, since K cannot be 
both greater than and less than C at the same point. If «j < njrl, we have a contradiction as K(n) is a 
monotone increasing function of n. 
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Now, if the minimum n exists it is the largest n for which K(n) < C + 1. Since K(n) is a mono- 
tone increasing function of rt, it follows immediately that the n minimizing £[lV(n)] is a monotone 
increasing function of C. 

□ 


One may expect that following a remapping, the difference in the time per step required by the 
maximally loaded processor and the average processor will tend to increase in time. Theorem 5.1 
states that as long as £[7^(1)] - £[T(i')] is a monotone increasing function of i, then £[W(n)] has at 
most one minimum. Under ideal circumstances, an existing minimum defines the optimal fixed-interval 
remapping policy, a fact demonstrated by Theorem 5.2. 

THEOREM 5.2 : Suppose that remapping after n steps resets the £,• sequence, so that the expect- 

n 

ed loss between remappings is £ £,- + C. If £[W(n)] is minimized at n, then the optimal fixed- 

£=1 

interval remapping policy (including the policy which never balances) is the policy which remaps 
every h steps. 

PROOF: The average loss at the kth remapping is 

' * 
kf,E;+C 

— T = E[W(n)l 

kn 

Thus as k—¥°°, the limiting average loss is simply £[W(n)], which is minimized when n = n. To see 
that £[iy(n)] is less than the limiting average loss per step of never remapping, we observe that 

n 

XEi 

£[w(n)] = 

n 

is the average loss per step without remapping, and that 

£[W(n)] = £[w(«)] + — . 

n 

Since £[W(«)] is increasing and the difference between E\W(n)] and £[w(n)] gets arbitrarily small with 
increasing n, the fact that £[VT(«)] is increasing for n > n ensures that 
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£[W(/t)] < lim £[w(«)]. 

n-*oo 

□ 

While recognizing that the reseting assumption upon which Theorem 5.2 rests may not be met in prac- 
tice, the theorem’s statement supports the idea of SAR: seeking to remap when W(n) is minimized. 

It is reasonable to investigate the conditions under which the simulated loads produced by the 
MUM model lead to a sequence of £,• that is monotone increasing. We assume that all chains begin at 
the "middle" state. We will show that as long as it is relatively unlikely that a processor will require 
the maximum possible time L to complete a step, we are guaranteed a monotone increasing sequence 
of Theorem 5.3 below demonstrates that a sufficient condition that ensures that the situation 
described above will prevail is that the transition probability p be less than or equal to 2/3. 

We state two lemmas, which are used to prove Theorem 5.3. An essential tool used in the state- 
ment of one of these lemma’s is the theory of stochastic variability (see [36] ). A random variable X 
is said to be stochastically more variable than random variable Y, denoted X >„ Y, if £[g(X)] > £[g(F)] 
for every increasing convex function g. X > v Y intuitively means that X places more probability on 
occurrences of high sample values than does Y. For our purposes, a result given in [36] is very impor- 
tant: if X\, • • • Xk is a group of independent random variables, Y lt • • • ,Yk is a group of independent 
random variables, and X; > v y t - for i = 1,2, • • • ,k, then 

EtmaxC*!, • • • X*)] > £[max(r„ • • • ,T*)]. (3) 

Also, for every j = 1,2,..., L, and step n, we denote the probability mass function 

Pj(s;n) = Prob{Tj(n) = s}. 

The proofs of the following two lemmas are detailed, and have been relegated to Appendix B. 
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LEMMA 5.1 : Assume that at step 0, all chains are in state (L+l)/2, i.e., for all j, 
Pj((L+ 1)/2;0) = 1; also assume that p < 2/3. Then for all steps n, we have pfk—\\n) < Pj(k;ri) for 
k < (L+l)/2, and p{k+\\ri) < pfk\n) for k > (L+l)/2. 

□ 


Lemma 5.1 simply identifies conditions which ensure that the probability mass function for a chain’s 
state distribution is unimodal at every step. The next lemma shows that if the probability of chain j 
being in state L is always less than or equal to the probability of its being in any other given state, 
then Tj(n) is stochastically more variable than T } {n- 1). 

LEMMA 5.2 : If for every n, 

Prob{Tj{n ) = L} = min Prob{Tj{n) = k } 

then Tj(n) > v Tj{n-l) for all n. 

□ 


Note that if the conditions of Lemma 5.1 are satisfied, then the conditions of Lemma 5.2 are also 
satisfied. In this case, we also have 

THEOREM 5.3 : If Prob{Tj{0) = (L+ 1)/2} = 1, and p < 2/3, then E\W(n)] has at most one local 
minimum. If the minimum exists, it is a monotone increasing function of the remapping cost C. 

PROOF: The condition that Prob{T } {n) = L} = min | Prob{Tj(n) = k} j- follows immediately from 

Lemma 5.1, hence by Lemma 5.2 we have the conclusion that Tj{n) > v Tj{ti-\) for all n, and all j. 
Since the chains are independent, it follows from (3) that 

The average time spent by the processors performing computations during step i is 

m = 77 S r/i), 

N j= 1 


so that the expected value of T{i) is 
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OT01 = ttZ^/OL 
n M 

Because we assumed that all chains j are initially in the middle state (L+l)/2, and the chains are sym- 
metric, it follows that E[Tj(i)] = (L+l)/2 for all i. But since £[7^(1)] increases in i and £|T(i)] is con- 
stant, we see that £,- is monotone increasing in i. Our conclusion then follows from Theorem 5.1, 

□ 

It is interesting to note that the result above does not depend on the MUM chains being identically dis- 
tributed; Theorem 5.3 applies if every chain’s transition probability is less than 2/3. 

The simplicity of the MUM model allows us to derive an exact expression for E\W(n )]. In 
theory, we could use this expression to find the n which minimizes E\W(n)]. However, this expression 
is computationally cumbersome, and does not immediately lend itself to useful interpretation. We there- 
fore also present two more tractable approximations to £[W(n)], and comment on the relationships they 
show between MUM model parameters. 

It is straightforward to compute £[W(«)] for the MUM model. For each processor j, we define 
the probability state vector describing the distribution of Tftri) as p fim) = (p y { 1 ;m), • • • jj J (L-,m)) We 
also define M as the matrix of the Markov chain’s one step transition probabilities. The probability 
state vector for the mth step is given by the vector-matrix product 

p/m)=p/0)M'". 

Note that this expression depends on the distribution of chain fs initial state. The cumulative dis- 
tribution function for Tpri) may be written as 

3 

Pj{s\m) = Prob{Tj{m) <, s} = £ pfi'jri)- 

r=l 

The time required by a N processor system to complete step m is given by the distribution of the max- 
imum order statistic of the processors’ states at step m. For every state s and step m, the probability 
that the maximum state exceeds s is equal to one minus the probability that all processors have a state 
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less than s at step m. The cumulative distribution function for the maximum order statistic at step m 
is thus 


Prob{max{T,{m)} > 5 } = 1 — Prob{T.{m ) < s,j = 1,2, • • • //} 
j 


N 


= 1 - n Pf&n). 


It is well known that for any non-negative discrete random variable X, £[X] = ^Prob{X > a}. The 

aiO 

mean time for the system to compute step m is thus expressed as 


£[W"01 = E 


N 




S =1 [ >=1 

As mentioned previously in this section, the expected value of T(m ) is 


1 


N 


E[T(m)\ - -f-Z E[Tj{m)]. 
N J= 1 


Furthermore, E[Tj(m )] may be written as 

£[r/m)]=£fl-P/^-l;m)l. 

j=t*- J 

We can now derive an expression for £[W(n)]. By definition. 


E[W(n)]= 


E [£[T max (m)]-£[r(m)]j + C 

m=l 


Substituting the expressions derived above into (4), we obtain 


( 4 ) 


Etwm = 


n L 

EE 

m= 1 j=1 


N 


N 


— 1 JHO-IP*/*- 1 w) 

" M M 


+ C 


(5) 


The expression above may be used to calculate the expected value of the time lost per processor 
as a function of the number of steps since the last remapping. In principle, we could then compute the 
n = n which minimizes expression (5). However, this precise formulation does not give us any insight 
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into the qualitative effects that the MUM model parameters have on h. To attempt to gain this insight, 
we considered two approximations for £[W(«)] which better express relationships between the MUM 
model parameters. Since SAR appears to adaptively find the optimal fixed-interval remapping fre- 
quency, we expect that these relationships affect SAR remapping frequency in the same way. The first 
approximation we describe is asymptotic in the number of processors; the second approximation uses 
an upper bound on the max order statistic of a symmetric random variable. 

Our first approximation assumes that the number of states in a processor’s chain is odd, and 
denotes the "middle" state by K = (L+ l)/2. The following lemma shows how n depends on the 
MUM model parameters L and C as the number of processors gets large. 


LEMMA 53 : As then 

* 

<2 C 

n = " I^-K - 1 

no minimum 



□ 


The proof of Lemma 5.3 is given in Appendix B. 

Lemma 5.3 shows that for small values of C, n increases in C as a square root, until C reaches a 
threshold. For values of C larger than this threshold, £[(W)] cannot be minimized, which implies that 
the cost of remapping is too high to ever consider remapping. We recognize this critical threshold as 
essentially the expected processor state squared, divided by two. Lemma 5.3 thus helps to quantify the 
role that the remapping cost plays in the MUM model. It identifies a relationship between permissible 
remapping costs and processor execution time, showing that remapping improves performance even if 
the remapping cost is relatively large compared to processor execution time. 
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The asymptotic assumptions underlying Lemma 5.3 lead to one discomfitting effect: the variation 
in the Markov chains does not play a role in determining n. A second approximation to E[W(n)] is 
more sensitive to this variation. Lemma 5.4 states the general dependence of n on the number of pro- 
cessors N, the cost C of remapping, and the MUM transition parameter p. 


LEMMA 5.4 : Using an approximation given by [14], n is a function of 
d(N) = N~ m . 


C 

Nd(N)<p 


, where 


□ 


The proof of Lemma 5.4 is also given in Appendix B. 

We have noted that as the cost of remapping increases, it makes sense to remap less frequently so 

as to better amortize the cost of remapping over a larger number of computational steps. For the 

C irt 

MUM model, n is actually a function of the expression p, where d(N) ~ AT . The cost C of 

Nd(N)vp 

remapping, the number of processors AT and the activity p of the processors together determine the 
value of n. One can see that increasing the number of processors, and increasing the activity associ- 
ated with each processor leads to a reduction in the number of steps between remappings analogous to 
that obtained by decreasing the cost of remapping. 

This section has examined the statistic W(n). We have demonstrated general conditions which 
ensure that E[W(n)] has at most one minimum, and that the existence of the minimum at n implies that 
remapping every n steps is the optimal fixed-interval remapping policy. This result supports our use of 
the SAR policy by suggesting that performance gains are achieved by minimizing W(n). We then 
looked at E[W(n)] specifically under the MUM model assumptions. We showed general conditions 
ensuring that E\W(n)\ has at most one minimum; we showed that E[W(n)] is computable, and analyzed 
the interrelationships between MUM model parameters by looking at how they affect the expected fre- 
quency of remapping. 
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6. Load Dependency Model 

While the MUM model is analytically tractable, some of its assumptions may not be realized in 
practice. For example, MUM assumes that a processor’s load drift is stochastically independent of any 
other processor’s load drift. It is easy to construct examples where this assumption is violated. This 
flaw could be corrected by allowing correlation between chains’ transitions, but then an appropriate 
model of correlation would have to be determined. MUM also assumes homogeneous Markov chains; 
there is no problem in allowing heterogeneous chains, but the analysis we have developed does not 
apply to such a model. More seriously, the MUM model implicitly assumes that the transitional 
behavior of a processor’s computational load is determined by the processor, rather than the load. This 
assumption is embodied in the assumption of transitional invariance under remapping, used to prove 
Theorem 3.1. This flaw is corrected in a model where the distinction between a processor and its load 
is clearly drawn. We call this the Load Dependency (LD) model. 

The LD model directly simulates the spatial distribution of computational load in a domain. We 
consider a two dimensional plane in which activity occurs, for example, a factory floor. To simulate 
this activity we impose a dense regular grid upon the plane; each square of the grid defines an activity 
point. We suppose that activity in the plane is discretized in simulation time, and model the behavior 
of activity as follows. Each time step a certain amount of activity may occur at an activity point. This 
activity is simulated (for example, arrival of parts to a manufacturing assembly station), causing a cer- 
tain amount of computation. By the next time step some of that activity may have moved to neighbor- 
ing activity points. This movement of activity simulates the movement of physical objects in a physi- 
cal domain, and is modeled by the movement of work units. A work unit is always positioned at some 
activity point, and has a weight describing its computational demand at that activity point. From one 
time step to the next, a work unit may move from an activity point to a neighboring activity point; this 
movement is governed probabilistically. In the LD model, the probability that a work unit will move 
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from one activity point to another, as the problem goes from one time step to the next is called the 
transition probability linking the two activity points. 

We employ binary dissection [6] to partition the activity points into N activity regions, where the 
points in an activity region form a rectangular mass. The weight of an activity point is taken to be the 
sum of the weights of work units at the point, at the time that the partitioning is performed. The com- 
putational load on a processor during a time step is found by adding the weights of all work units 
resident on activity points assigned to that processor. 

In a wide variety of problems, including those mentioned in section 1 as examples of varying 
demand distribution problems, data dependencies are quite local. Decomposition of a domain into con- 
tiguous regions with a relatively small perimeter to area ratio is thus generally desirable for reducing 
the quantity of information that must be exchanged between partitions. Furthermore, due to the local 
nature of the data dependencies, the communication required between partitions in a binary dissection 
will generally be greatest in partitions that are in physical proximity. The analysis in [6] shows that 
this type of partition is effective for static remapping, and is easily mapped onto various types of 
parallel architectures. Estimates are also obtained of the communication costs incurred when binary 
dissection is used to partition a problem’s domain, and the resulting partitions are mapped onto a given 
architecture. The communication cost estimates obtained by such analysis are inevitably problem, 
mapping and architecture dependent. This binary dissection is briefly described in Appendix C. 

A processor’s load changes from one time step to the next when a work unit either moves to an 
activity point assigned to another processor, or similarly moves from an activity point in a different 
processor. This explicit modeling of work unit movement removes the most serious flaw with the 
MUM model. Unlike the MUM model, the change in a processor’s computational load from one time 
step to the next is explicitly dependent on its own load, and on the loads of processors with neighbor- 
ing activity regions. 
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To ensure the correctness of the simulation, we require that all computation associated with a 
time step be completed before the simulation advances to the next time step. Thus, as in the MUM 
model, the time required to complete a time step is the maximum computation time among all proces- 
sors. Again like the MUM model, as time progresses any initial balance will disappear, and average 
processor utilization will drop. This is particularly true if the work unit movement probabilities are 
anisotropic. 

The SAR policy can also be used with the LD model, since the W(ri) statistic requires only the 
mean processor execution time, the maximum processor execution time per time step, and the remap- 
ping cost C. The performance of SAR on the LD model was examined by once again comparing SAR 
to the performance of fixed interval polices. Figure 5 plots expected processor utilization as a function 
of time for remapping costs of 50 and 100 work units, when a 64 by 64 mesh of activity points is ini- 
tialized with one work unit per activity point, and 16 processors are employed. The transition proba- 
bilities are anisotropic (given in the figure legend), so that the work tends to drift to the upper right 
portion of the mesh over time. Not taken into account here is the cost of the interprocessor communi- 
cation that occurs at the end of each step when partitions exchange newly computed results. As was 
observed in the MUM model the the performance of the SAR rule and the average number of elapsed 
steps between SAR remappings corresponded closely to that of the fixed interval leading to the optimal 
performance. In figure 5, the performance of SAR for a given cost is superior to that obtained from 
fixed load balancing at the optimal frequency. In other simulations, the performance obtained from 
SAR was comparable to, but slighdy below that obtained from the optimal fixed load balancing 
method. Note that the performance obtained by SAR in figure 5 is markedly greater than that obtained 


when no remapping is performed. 
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7. Summary 

To date, most load balancing problems addressed in the literature concern systems which can be 
modeled by a queueing network. A large class of parallel computations are not well modeled by 
queues and job arrivals, particularly those solving scientific problems. Yet the time-variant behavior of 
these computations, coupled with their synchronization needs, creates the following performance prob- 
lem. Good processor utilization requires that the computational load be balanced between processors, 
yet a good balance can’t be sustained because of the variation in the computational workload. To treat 
this problem, we need to both model the phenomenon of performance degradation, and develop remap- 
ping decision policies which effectively determine when the computational load should be remapped 
onto the parallel machine . This paper has addressed both issues. We describe two different models 
of load evolution. One model is simpler than the other, and can be analyzed. The other model better 
captures a means by which a processor’s load changes in time. We have developed and studied an 
adaptive remapping decision policy SAR which proves to be effective on both models. SAR does not 
depend on the details of the model structure; rather, it attempts to minimize a statistic which measures 
the long-term average system degradation (including that due to remapping) as a function of time. We 
have also analytically demonstrated conditions ensuring that the statistic’s mean has a single local 
minimum, and that the optimal fixed-interval remapping policy is to remap when the statistic’s mean is 
minimized. These analytic results validate SAR’s approach. Because of its appealing empirical and 
analytical properties, SAR is a promising candidate for use in an actual parallel system. 
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Appendix A 

In this appendix we prove Theorem 3.1. The assumptions we use for its proof are 

• The Markov chains are homogeneous, and unbounded (no maximum nor minimum state). 

• The Markov chains’ transition probabilities are unaffected by remapping. 

Both assumptions are used for analytic tractability. Under these assumptions, we will show that the 
average value remapping policy is optimal in the sense that among all remapping schemes we could 
apply at time n, the average value remapping scheme minimizes £[T max (n+J)] for all d >0. The use of 
the average remapping scheme therefore minimizes the expected duration of every future computation 
step. 

Our basic tool for establishing the average value policy’s optimality is the theory of stochastic 
variability, alluded to in section 5. Recall that a random variable X is said to be stochastically more 
variable than random variable T, denoted X > v Y, if £[g(X)] > E\g{Y)\ for every increasing convex 
function g. This definition immediately implies that if X !>„ Y, then £[X] > E[Y\. For non-negative ran- 
dom variables, an equivalent definition [36] is that 

oo oo 

J Prob{X >t}dt> | Prob{Y > f} dt for all a > 0. 

a a 

For our purposes, the following result, also from [36], is important, if X it • • • JC n a group of 
independent random variables, Y lt • • • ,Y n is a group of independent random variables, and X- t > v Y t for 
i = 1,2, •••,«, then 

g(X h ■•■ > X n ) > v g(Y h • • • ,r„) (6) 

for all increasing convex functions g. 

We will first show that average value remapping is optimal in a system with two chains. We 
observe that if |rj(n) - T 2 (n) | < 1, then there is no benefit to be gained from remapping. We conse- 
quently assume that if we remap at n , then \T\ (n) — r 2 (n)| > 1. Our results stem principally from the 
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following lemma. 

LEMMA A-l : Let X and Y be independent, identically distributed, integer valued random vari- 
ables. If a >b- 1, then 

max{a+X, b+Y] > v max{<3-l+X, fe+l+y}. 

PROOF: Let g be any increasing convex function, and let 

D g (x,y) = g(max{a+x, b+y}) - g(max{a-l+x, £H-l+y}). 

We will argue that £[D g (X,y)] ^ 0, which will prove the lemma. We consider the value of 

max{a+x, b+y} — max{a-l+x, fc+l+y} (7) 

as a function of x and y. When a-l+x > fo+l+y this difference is easily seen to be 1. When 

a- l+x < b+\+y and a+x > b+y expression (7) is equal to (a-b- 1) + (x-y); furthermore, b-a < x-y. 
But since x and y are integer valued, we have £>-<3+1 £ x-y, so that expression (7) is non-negative. In 
the cases considered, the increasing nature of g ensures that D g {x,y ) £ 0. Finally, when a+x < b+y, 
expression (7) equals -1. Suppose then that X = x, Y = y, and a-b < y-x. Since X and Y are indepen- 
dent and identically distributed, we have 

Prob{X = x, Y = y} = Prob{X = y,Y = x}. 

Thus for any samples of X = x and Y = y which cause (7) to be -1, it is equally likely that X = y and 
Y = x, in which case (7) is equal to 1. Since y > x, we have max{a+y, b+x} > max{a+x, b+y}. It then 
follows from the increasing convexity of g that D g (y, x) > | D g (x, y) |. The importance of this observa- 
tion is that every sample of X and T which causes D g {x,y) to be negative is counter-balanced by an 
equally probable sample of X and Y which yields D g (x,y ) with a larger positive magnitude. It follows 
then that E[D g (X, 7)] > 0, which proves the lemma. 

□ 

To apply this lemma to our problem, we note that if X is the Markov chain single step random 
variable, and if X(d) denotes a d-fold convolution of X, then 
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T„ ax (n+d,a,b) = max{a+X^{d), b+X 2 (d)} 

where X x {d) and X 2 (d) are independent It follows immediately from Lemma A-l that 
T nax (n+d,a,b) > y T m ^{n¥d,a-\,b+^). We then use this result to show that average value remapping is 
optimal for a TV chain system. 


THEOREM 3.1 : Assume that the Markov chains are homogeneous, and unbounded (no max- 
imum nor minimum state), and that the transition probabilities are unaffected by remapping. Sup- 
pose 7’ 1 («yS 1 ) + T 2 (nj 2 ) + ■ • • + Ttf(nj N ) = K. For every i = 1,2, • • • ,7V, define 



[/st/tvJ 


+ 1 if i < K mod TV 
if i > K mod TV 


Then for every d > 0, 

E[T irM (n+d,s l , • • • , 5 N )] > E[T mia (n+d,a u • • • ,%)]. 


PROOF: Without loss of generality, assume that Jj > s 2 > • • • > s N . Note first that 


r E IB 2x( n +d,s i , • • • yS 1 ^) 

= max{ maxlT^n+d^Tf^n+d^)}, T 2 (n+d,s£, • • • ,7} v r-l(/i-ki,.y w -l)}. 


Now 

max{T 1 (n+<i,Ji),rj V (n+ii,j^)} > v max{r 1 («+<i,5]-l),r A r(«+(i,j /v rl-l)} > 

so that by (6), 


T m *x(n+d,s u ■ ■ • ,s N ) > v r max («+£? y s 1 -l, • • • Jn+1). 

N 

This argument applies to any set of s t , • - • ,s N such that £ sj = K. We may therefore apply the argu- 

M 

ment repeatedly to find that 


1'max(. n +d’ s 1 • ’ ' ' < s n) —v Enoxin+dyO j , • • 

The theorem’s conclusion follows immediately. 

□ 
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Appendix B 

In this appendix we prove Lemmas 5.1, 5.2, 5.3, and 5.4. Lemmas 5.1 and 5.2 are used in sec- 
tion 5 to prove Theorem 5.1. That theorem gives sufficient conditions for the difference 
ZitTmaxW] - £[T(z')] arising from the MUM model to be monotone increasing. 

LEMMA 5.1 : Assume that at step 0, all chains are in state (L+l)/2, i.e., for all j, 
Pj({La- l)/2;0) = 1; also assume that p < 2/3. Then for all steps n, we have Pj(k-\;n) < pik\n ) for 
k < (L+l)/2, and pj(k+l;rt) < Pj{k\n) for k > (L+l)/2. 

PROOF: 

We induct on rt. For n=0, the claim is trivially true since p } {(L + 1 )/2;0) = 1. Assume that the claim 
is true for n, we shall show that it is also valid for n+1. By the symmetry of the probability mass 
function p/.k;n) , it suffices to only prove that Pj{k-l;n) <pj{k;n) for k< (L+l)/2. There are three 
cases to consider. Each case will employ the Chapman-Kolmogorov difference equations [36] to 
describe the state probabilities at step n+ 1 in terms of the state probabilities at step n. 

Case 1: 2 < k < (L+l)/2: Now 

Pj(k;n+l) - pfk\n+\) = ^(pj{k-\\n) - pj(k-2;n)) 

+ (1 - P)(Pj{k;n) - Pj{k-V,n)) + £(pj{k+l;n) - Pj{k;n)) 

>0 

since Pj(s]n) > Pj(s-l;n) for 1 < s < (L+l)/2 by the induction hypothesis. 

Case 2: k = 2: Noting that 


Pj{Un+l) = (1 - |)p/l;») + £ Pj (2-,n) 

= ^PjiUn) + (1 - p)Pj{V,n) + Zp/2;n) 
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we have 


Pj(2]n+l) -pj(l-,n+l) = ^-(p/l;«) - p/l;«)) + (1 - p)(p/2‘,n) - pf\\n)) 

+ £(PjQ\n)-pp.\ri)) 

2 : 0 

again by the induction hypothesis. 

Case 3: k = (L+l)/2: 

Here we must make assumptions on the allowable values of p. Utilizing the induction hypothesis to 
assume that pj((L+ 1)/2 — l;/i) > pj((L+2)/2 — 2;ri), we obtain the following bound on 
PjiiL+m - l;iri-l). 

P&+ 1)/2 - 1;«+1) = £pj{(L+l)/2\n) + (1 - p)Pj«L+\)/2 - l;n) + -|p/(L+l)/2 - 2;n) 

< £p/(L+l)/2;n ) + (1 - £) Pj {(L+1)/2 - l;n). 

Utilizing the symmetry of the probability mass function we obtain: 

p/(L+l)/2;«+l) = £ Pj ((L+1)/2 + l;n) + (l-p)p/(L+l)/2;n) + -|p/(L+l)/2 - 1;«) 

= ( l-p)Pj((L+l)/2‘jn) + (p) Pj {(L+l )I2 - 1 vn). 

Thus when ^ 1— /? , or equivalently p £ — , we obtain Pj ((L+\)/2 — 1;«) < P] ((L+l)/2;n). 

□ 


Note that when o > — , Lemma 5.1’s conclusion is not true for a three state chain. When 
3 


2 1 £ 1 

/? = -—+£ and «=1, it is straightforward to verify that pjil',1) = — + — , p/2;l) = — - e, and 

J J M J 
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Next we prove Lemma 5.2. 

LEMMA 5.2 : If for every n 

Prob{Tj(n) — L} = min « Prob{Tfn ) = k] 

■ 

then Tj{n) > v Tfiti- 1) for all n. 

PROOF: The > v order relation is discussed in Appendix A. We will simplify notation in this proof 
by writing T } {n) simply as T{n), suppressing the index j. We will demonstrate that 

oo oo 

J Prob{T(n) > t} dt > J Prob{T(n- 1) > r} dt for all a > 0. (9) 

a a 

We first note that this relationship is true for 0 < a < 1. This follows since the case where a = 0 

describes the means (which are equal), and T(k) > 1 for all k implies that 

1 l 

J Prob{T(n) > t) dt = J Prob{T(n- 1) > r} dt. 

a a 

Supposing that a > 1, we observe that for any t, 

Prob{T(n) > t} = Prob{T(n- 1) + X n > t} (10) 

where X n is the random step taken by the chain between steps n- 1 and n. By the theorem of total pro- 
bability, 

Prob{T{n- 1) + X n > t } = Prob{X n = - l}Prob{T(n-l ) > t + 1 | X n = -1} 

+ Prob{X n = 0}Prob{T(n-l) > t \ X n = 0} 

+ Prob{X n = l}Prob{T(n-l) > t- 1 | X n = 1}. 

We then integrate this expression with respect to t from a to By adjusting the bounds of integration 
so that every inequality is expressed with respect to t alone, then factoring out integrals from a to <*>, 

oo 

and finally combining those factored integrals into the term J Prob{T{n-\) > f} dt, it is shown directly 


that 
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j Prob{T(n) > t} dt = J Prob{T[n- 1) > f} dt 

a a 

a 

+ Prob{X n = 1} f Prob{T(n- 1) >t\X n =l}dt 
< 2-1 

0+1 

- Prob{X n = -1} J Prob[T(n-l) > i\ X n = -1} dt. 


Noting that Prob{X n = —1} = Prob{X n = 1}, we see that relation (9) in this case is satisfied if and only 
if 


a a+1 

J Prob{T(n-l) > 1 1 X n = 1} dt Z f Prob{T(n- 1) > t \ X n = -1} dt. 

a — 1 a 

The knowledge that X n = 1 implies only that T(n-l) * L; thus 

Prob{T(n- 1) > 1 1 X n = 1} = Prob{T(n- 1) > t \ r(n-l) ^ L}. 

It is then mechanical to show that 


Prob{T(n- 1) > 1 1 X H = 1} = - r °~ - ^ n > ^ — Prob{T{n-\) - L] 

Prob{T(n- 1) * L] 

Since X n = -1 only if T(n-l) ^ 1, a similar exercise demonstrates that 


Prob{T{n-l)>t\X n = -l} = " iw n ' 

/ > ra6{r(n-l) ^1} 

We then note that by symmetry of T(n-l)'s distribution, the denominators of these last two quotients 
are equal. Thus we see that to satisfy (9), we need only show that 


<2 < 2+1 

J Prob{T(n-l) > f} dt - Prob{T(n-\) = L] > J Prob{T{n~\) > t} dt . (11) 

a— 1 a 

Since T(n— 1) is an integer valued random variable, Prob{T(n- 1) > t] is a decreasing step function of t, 
with steps occurring at integer values of t. Furthermore, the change in the function’s value at t=j is 
precisely -Prob{T(n-\) = j}. Letting j(t) denote the smallest integer larger than t, we see that the 
assumption that Prob{T(n-l) = k) is minimized at k = L implies that for all positive f, 
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Prob{T(n- 1) > r} - Prob{T(n- 1) = L} £ Prob{T(n- 1) > t} - Prob{T(n- 1) = j(t)} (12) 

= Prob{T(n- 1) > r+1}. 

We can now integrate both sides of (12) from a- 1 to a, change variables and obtain (11). This shows 
that (9) is satisfied, so that T(n) > v T(n- 1) as claimed. 

□ 


We next provide proofs to Lemmas 5.3 and 5.4, which describe how the n minimizing E\W(ri)\ 
behaves as a function of MUM model parameters. Lemma 5.3 gives n as the number of processors 
approaches °°. The basic idea behind Lemma 5.3’s proof is that with an infinite number of processors, 
the maximum state will always advance forward one state until the maximum state L is reached. After 
this point, the maximum state will always be L. We now substantiate these claims. 

We have assumed that the Markov chains are all identical, have a odd number of states and that 
the initial state K of each Markov chain is equal to (L+l)/2. We hence assume that Pj(s; 0) = 0 for 
1 < s < K and Pj{s\Q) =1 for K < s < L. Again to simplify notation, we suppress the dependence of 
the cumulative distribution function Pj(s;n) on j, since all chains are identical. Then rewriting equation 

( 5 ): 


E[W(n)] = 


£ £ [jp(.s- 1 ;m)-(P(s - 1 ;w)) N ] + C 

m=ls=l 


(13) 


n 

For m < L-K-l, all chains must be in states numbered less than or equal to K + m as the state 
number of a chain can increase by at most one per step. Consequently for s > K + m, the probability 
that T(m) is equal to s is zero, and hence P(s;m) = 1 for s > K+m-l. When m > L-K-l, every state 
has a non-zero probability of occupancy so that P(syn) = 1 only for s = L. Thus 
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1 ifK + m+l£s<,L 
0 otherwise 

It follows that as N -> oo, 

L 

Yj F^is-lyn) -L- ( K+m ) m< L- K - 1 

*=i 

L 

Y r(s—l‘/n) = 0 m>L — K — 1 

S= 1 

The expected state of each chain is equal to K for all n due to the symmetry of the Markov chains’ 

transition probabilities and the symmetry of the initial probability distribution. Hence, 
L L 

£(l-P(,s-l;m)) = K for all s and thus £.P(s-l;m) = L-K. Substituting the above into (13) yields 
j=i j=i 


N - 4 


lim P(s-l;m) N = 

N-r*> 


E[W(n)] = 




^ + 1 + C 

2 2 n 


n < L - K - 1 (B. 1) 

rt> L - K - 1 (B.2) 

Note that (B.l) and (B.2) take on identical values at L-K- 1, and hence (14) is continuous. 


L K ^C-{l^K-\){L-K) 
In 


(14) 


We are now in a position to find the h that minimizes the asymptotic form of £[W(n)]. In order 
to derive simple expressions for h we shall allow n to assume any real value. The location of the 
minimum n, if it exists, depends on the remapping cost C in a way that is stated by Lemma 5.3. 

if c< ik£zlf 
2 

if ^ c< (l-K-WLr-K) 

2 2 
[ no minimum otherwise 


LEMMA 5 J : As N— >°°, then 

<2 C 


n = 


L-K- 1 


PROOF: We will find the minimum values n\ of (B.l) and n 2 of (B.2) when either exist. The con- 
tinuity of (14) is then used to make statements about the location and existence of the minimum value 
of (14). Simple calculus shows that if C is positive, then nil + 1/2 + C/n has a local minimum at 

= V2C. This local minimum lies in (B.l)’s functional range only if C < (B.l) is other- 
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wise minimized at its domain endpoint «i = L-K— 1. Having established rtfs form, we consider the 
minimization of (B.2) at n 2 . We observe that (B.2) is a strictly increasing function of n when 

C< — — . Consequently (B.2) is minimized at its domain endpoint rtj = L-K- 1. For 

C •> ( L-K l)(L-iT) ^ (b. 2) is a constant or decreasing function, and hence has no minimum. 


Thus 


when C<- ^ K ^ , we have n x = V2C < L-K— l and n 2 = L-K-l) it follows that n = 'JlC. 


For < c < Sk£=Mf£L t both and rt 2 equal L-K-L For C > Sk*£Md£, (B .i) 

2 2 2 

is increasing over [0, L-K- 1], and (B.2) is non-decreasing over [L-K-l, °°]- Recalling that (14) is 

continuous, we see then that under these conditions (14) has no local minimum. 


Lemma 5.4 shows how different MUM model parameters affect the optimal static remapping fre- 
quency n. Unlike Lemma 5.3, it specifically incorporates the "activity parameter" p. The analysis lead- 
ing to Lemma 5.4’s statement makes use of the bounds on the expectation of the maximum order 
statistic of a set of random variables with the same symmetric distribution. These bounds prove to be 
quite tight for relatively small numbers of chains and consequently provide a useful approximation for 
the expectation [14], We also make the approximation that each Markov chain has an infinite number 
of states. As before, the one step transition probability from state s to state s+1 is p/2, from state s to 
state 5-1 is p/2, and the probability of remaining in s is 1-p. This analysis takes into account the 
effect of the number of chains on the form of £[W(rt)], and allows an exploration of the relationship 
between the steps between remappings, and the cost of remapping, the number of chains and the 


chains’ activities. 
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LEMMA 5.4 : Using an approximation given by [14], n is a function of 
d(N) ~ N~ m . 


C 

Nd(N)-Jp 


, where 


PROOF: Since the state distribution for a chain T{n ) at time n is symmetric, we may apply analysis in 
[14] to obtain 


£[ W*0 - T(m)] 


(Var(T(m)) 


7^—— < Nd(N)l 2. 


This expression is equivalent to 


- T(m)] < 


where 


d(N) = 


2 [, -w ! 

N-\. 


2N-\ 


in 


Now for any time step n, E[W(n)\ may be written as 


£[W(n)] = 


ZCElT^Cm) - T(m)]) + C 

m= 1 


which may be approximated using equation (15) by 


E\wm = 


Nd(N)Z^(Var(T(m))' A + C 

m=l 

2 n 


(15) 


06) 


The next step is to use the properties of the MUM model to rewrite the expression for E[W(rij\ in 
a closed form that depends on the Markov chain transition probabilities. In order to do this we must 
derive an expression for V ar(T(m)). Under the assumption that a processor’s state is not bounded from 
above or below, T(m ) may be written as 


T{m) = 7X0)+ £ X k 

k= 1 


where for all k. 
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X = X k 


—1 with probability p! 2 

0 with probability 1 -p 

1 with probability pi 2 


Under the MUM model, Prob{T( 0) = K] = 1; i.e. we have deterministically specified 7(0). It follows 
that 


Var(T(m)) = £ Uar(X*) 

h= 1 

= mVar(X) = mp. 

We then obtain the following approximation for £[W(«)], 


£[W(/z)> 


Nd(N)<p'£'fm + C 

m=l 


(17) 


Analysis entirely similar to that establishing Theorem 5.1 verifies that the expression above has at most 
local minimum. Since 


E[W(n)] 


— — — p 

mii Nd(N)'fp 


the point h minimizing E\W{ri)\ (if that point exists) is determined by a function F of 


Nd(N)'Jp ' 


h = F\ 


NdW'lp 


□ 


To evaluate the degree to which the approximations made affect the predicted form of £(W(n)], a 
comparison between expression (17) and simulation results are depicted in figure 6. This figure por- 
trays results for eight chains and a variety of remapping costs. Each simulation curve depicting 
E\W{n )] is obtained from 100 sample paths. Equation (17) yields a good approximation for eight or 
fewer chains; for larger numbers of chains the upper bound d(N) becomes an increasingly poor approx- 
imation. In tests using twenty chains, we noticed a large discrepancy between simulation results and 
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the values given by (17). 

Appendix C 

In this appendix we outline the binary partitioning algorithm described by [6]. When the LD 
model is partitioned by this algorithm, every activity point is assigned a weight equal to the sum of its 
current work units’ weights. This gives rise to a matrix of weights, where each matrix entry 
corresponds to an activity point For every column k, we let CL{k) be the sum of weights in all 
columns j < k. Similarly, we let CR(k) be the sum of weights in all columns m> k. The first step in 
the binary dissection is to determine the column k which minimizes 
min{|CL(£) - C7?(£f 1)|, \CL{d) — C7?(£-l)|}. The matrix is split between the two columns minimizing 
the magnitude of this difference. The column partitioning of a matrix is illustrated by figure 7a. Then 
the same procedure is applied to the two resulting matrices, except that the sum of row weights is con- 
sidered, rather than the sum of column weights. As illustrated by figure 7b, at the completion of this 
step there will be four matrices of potentially varying dimensions, such that the sums of weights in the 
matrices are approximately equal. The procedure of dividing once by columns and twice by rows may 
be applied recursively to each of the resultant matrices. If the ratio of matrix points to number of parti- 
tions is small, binary dissection may yield partitions which are relatively imbalanced. We can expect 
increasingly balanced partitions as this ratio increases. 
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MUM Model: Expected Average Processor Utilization as a Function of Step Estimated from 500 

Sample Paths. Each chain has 19 states, p = 0.5. 

Figure la 
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HUM Model: Longterm Average Processor Idle Time per Step w(n) from Single Sample Path 

Each chain has 19 states, p = 0.5, load balancing cost = 8. 
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MUM Model: Expected Longterm Average Processor Idle Time per Step E[w(n)], Estimated 

from 500 Sample Paths. 

Each chain has 19 states, p = 0.5, load balancing cost = 8. 


Figure 2b 
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COST OF REMAPPING 


MUM Model: Performance of Optimal Remapping Decision Policy Compared with Performance 

of SAR. 

Three chains, 100 steps, each chain has 19 states, p =„0.5, SAR performance 
calculated from 500 sample paths. 


Figure 3 
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PERIODIC: STEPS BETWEEN REHAPPINGS/SAR: AVERAGE STEPS BETWEEN REMAPPINGS 

MUM Model: Performance of SAR Compared with Performance of Periodic Remapping. 

Eight chains, 400 steps, each chain has 19 states, p = 0.5, each data point 
calculated from 200 sample paths. 


Figure 4 
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LD Model: Performance of SAR Compared with Performance of Periodic Remapping 

64 by 64 activity array initialized with one work unit per activity point. 
Work unit transition probabilities: up - 0.1, right - 0.1, down - 0.05, 

left - 0.05. "Each data point calculated from 50 sample points. 


Figure 5 
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MUM Model: Analytic Approximation vs. Simulation Derived Values for Expected Longterm 
Average Processor Idle Time per Step E[w(n)]. Eight chains, each chain has 
19 states, p = 0.5, each simulated curve estimated from 100 sample paths. 

Figure 6 
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Figure 7a: Column Partition 
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Figure 7b: Column and Row Partitions 
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